Skip to content

DO NOT MERGE: ClusterExtensionRevision API Review#2610

Draft
perdasilva wants to merge 1 commit intoopenshift:masterfrom
perdasilva:cer-api-review
Draft

DO NOT MERGE: ClusterExtensionRevision API Review#2610
perdasilva wants to merge 1 commit intoopenshift:masterfrom
perdasilva:cer-api-review

Conversation

@perdasilva
Copy link

@perdasilva perdasilva commented Dec 2, 2025

OLMv1 ClusterExtensionRevision API Review

⚠️ DO NOT MERGE
This PR is only to facilitate API review.

ClusterExtensionRevision API

ClusterExtensionRevision objects are created by the ClusterExtension controller for each install, upgrade, or reconfiguration operation. They are owned by a ClusterExtension. A ClusterExtensionRevision carries the objects of the revision (i.e. the resources of a package at a given version + any user supplied configuration). We don't expect users to create any clusterextensionrevision objects. We only expect users to interact with this API when something goes wrong, or, in the future, in approval flows (i.e. a revision is kept from rolling out until there's some manual intervention by an user).

Lifecycle

  • Active: actively reconciling
  • Archived: inert (only for historical / auditing purposes)

Revision Rollout Strategy

  1. The objects are grouped into phases.
  2. The objects of each phase are applied together and in no particular order.
  3. Rollout will only progress to the next phase once all object of the current phase pass their probes
  4. Probes are attached to kinds and currently come in two flavours:
    1. status condition probes: e.g. Available=True
    2. field equality: e.g. .status.replicas == .status.updatedReplicas
  5. The rollout is complete once all phases are complete

Object ownership is transferred from previous to subsequent revisions.

Once a revision completes, it sets all previous revisions to Archived. When Archived the revision will clean up any objects it still manages.

Collision Control

Phase objects can be be configured to error on existing, adopt orphan resources, or force adopt resources.

Bundle Rollout Strategy Definition

registry+v1 bundles will have their rollout strategy defined by OLM in both phases and probes. If we directly support the Helm format, OLM will likey also define the rollout strategy.

Currently the registry+v1 bundle strategy defined by sorting the bundle manifests into different phases: namespaces, policies, rbac, crds, etc. with default probes, e.g. CRD has Established=True condition, Deployment has Available=True, etc.

Known Upcoming Changes

  • Removal of the Paused lifecycle state from docs and code
  • Removal of the "Migrated" reason on the "Available" condition
  • Moving collisionProtection out of the object and up to the top level

Known Limitations

The objects are currently defined inline in the CRD. In the (not so distant) future we will shard the resources across some kind of container (e.g. ConfigMap or Secret).

Future Work (beyond GA)

  • Revision approval
  • Rollback strategies
  • Paused lifecycle
  • Bundle author provided rollout definition

Open Questions

  • Can we release this API TechPreview without resource sharding and calling out the known limitation. This would require subsequent (possibly breaking) API changes.
  • Should the rollout definition be exposed through the API or just be a clusterextensionrevision controller concern that is opaque to users. Or rather, should we add the probe definitions to the API, or, rather than expose phases just have a "bad of objects"?

Signed-off-by: Per Goncalves da Silva <pegoncal@redhat.com>
@openshift-ci-robot
Copy link

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@coderabbitai
Copy link

coderabbitai bot commented Dec 2, 2025

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Excluded labels (none allowed) (1)
  • do-not-merge/work-in-progress

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 2, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 2, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 2, 2025

Hello @perdasilva! Some important instructions when contributing to openshift/api:
API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

@openshift-ci openshift-ci bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Dec 2, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 2, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign joelspeed for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@everettraven
Copy link
Contributor

Following up here - this is in my queue, but I'm working through reading the associated RFE first to ensure I've got full context before reviewing.

I should have an initial review within the next day or so.

Copy link
Contributor

@everettraven everettraven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass, a handful of comments.

It might be helpful to run the latest version of linter against this API to catch some of the smaller things our linter is good at catching :).

//
// Once a revision is set to "Archived", it cannot be un-archived.
//
// +kubebuilder:default="Active"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a required input instead of defaulting to Active?

What does it mean for multiple ClusterExtensionRevision objects for the same ClusterExtension to be Active?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Active is the current one, Archive are the old states.
CER are like snapshot of the CE in certain period of time

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there can never be more than 1 CER marked as Active?

Also, if this field is required, it shouldn't be defaulted. Required with a default is contradictory.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During an upgrade, while a new ClusterExtensionRevision is being rolled out and before the old revision has begun to be torn down, both CERs will be marked as Active.

I'm in agreement on the defaulting - plus when we create revisions we should be setting it explicitly anyway rather than relying on defaulting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure there is a clear description of what each state means and what it means if you have multiple instances of the Active state.

Comment on lines +99 to +101
// ClusterExtensionRevisionLifecycleStatePaused / "Paused" disables reconciliation of the ClusterExtensionRevision.
// Object changes will not be reconciled. However, status updates will be propagated.
ClusterExtensionRevisionLifecycleStatePaused ClusterExtensionRevisionLifecycleState = "Paused"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not used in the API as an allowed value?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'Pauses' will be removed for now. Even through the underlying library supports it, it's not needed rn.

// [RFC 1123]: https://tools.ietf.org/html/rfc1123
//
// +kubebuilder:validation:MaxLength=63
// +kubebuilder:validation:Pattern=`^[a-z]([-a-z0-9]*[a-z0-9])?$`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discourage the usage of the pattern marker because of the lack of helpful error messages that are returned to end-users (a regex string instead of a sentence explaining the constraints).

Instead, use a CEL expression that enforces this constraint and returns a helpful message like:

// +kubebuilder:validation:XValidation:rule=`!format.dns1123Label().validate(self).hasValue()`,message="the value must consist of only lowercase alphanumeric characters and hyphens, and must start with an alphabetic character and end with an alphanumeric character."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To check, what is the minimum KAS version this API is intended to be installed upon. For 1.35 onwards we will prefer format: k8s-short-name. From 1.32 you can use the formats library. If you're supporting versions older than that you'll have to use the regex

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL about the new format :)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL here as well - we should use format: k8s-short-name instead. I don't think this API will be active on anything less than 1.32, @perdasilva correct me if I'm wrong?

//
// +kubebuilder:validation:MaxLength=63
// +kubebuilder:validation:Pattern=`^[a-z]([-a-z0-9]*[a-z0-9])?$`
Name string `json:"name"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicitly mark the field as required.

Suggested change
Name string `json:"name"`
// +required
Name string `json:"name"`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the empty string a valid value?

Comment on lines +87 to +88
// +listType=map
// +listMapKey=name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document the uniqueness constraint keyed on the phase name in the GoDoc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain this one?
I did not get the request

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The +listType=map and +listMapKey=name markers enforce a uniqueness constraint on this field based on the name of the phase.

Markers are not included in generated documentation so we strongly recommend making all the input constraints explicitly in the GoDoc. I do not see in the GoDoc that would be used in generated documentation for the field mentioning that phase names must be unique, so I'm asking that it be added.


// collisionProtection controls whether the operator can adopt and modify objects
// that already exist on the cluster.
//
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: We normally try to follow a pattern where we explicitly introduce the allowed values as well with a sentence like:

Allowed values are Prevent, IfNoController, ...

before going into the When ... sections.

//
// +kubebuilder:default="Prevent"
// +kubebuilder:validation:Enum=Prevent;IfNoController;None
// +optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be helpful context, could you elaborate for me why you want this field to be optional?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving this field as optional helps protect against accidental misuse; the field needs to be one of these three, and requiring every user to choose one of them opens more opportunities to cause problems from controller conflicts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who is it that is creating this resource? My interpretation was that OLM itself would be creating these resources and not end-users - is that incorrect?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's correct, and a human shouldn't need to interact with this API apart from some kind of manual intervention. IMO this also helps for the non-human users, in the sense that it may help prevent bugs that create controller vs controller conflicts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend avoiding defaulting like this when it is a controller that is responsible for creating these resources - it obfuscates whether or not the controller intentionally set a value or if it omitted a value and it defaulted to that value.

If the field is effectively required, make it the responsibility of your controller to always set this value and have your default behavior explicitly encoded in the controller.

// Use this setting with extreme caution as it may cause multiple controllers to fight over
// the same resource, resulting in increased load on the API server and etcd.
//
// +kubebuilder:default="Prevent"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you ever intend to modify the default behavior? Done this way, the defaulting logic becomes a part of your API contract.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally can't imagine us changing the defaulting behavior away from Prevent, which is the safest option of the three.

//
// +kubebuilder:validation:EmbeddedResource
// +kubebuilder:pruning:PreserveUnknownFields
Object unstructured.Unstructured `json:"object"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be wrong here, but IIRC most embedded resource types will use https://pkg.go.dev/k8s.io/kubernetes/pkg/runtime#RawExtension as the type there.

I don't have a strong opinion here though as I've not reviewed many APIs that are embedding another resource blob within them. If this is working, 👍

- jsonPath: .metadata.creationTimestamp
name: Age
type: date
name: v1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v1? Have you shipped this API already? Typically, APIs for experimental type features will start as v1alpha1. In OpenShift, it is expected that TP APIs are v1alphaN until ready to be promoted.

@perdasilva
Copy link
Author

@everettraven Thank you for the thoughtful review. Imma queue up an issue to address what you've identified! =D

@everettraven
Copy link
Contributor

@everettraven Thank you for the thoughtful review. Imma queue up an issue to address what you've identified! =D

Just a note, it may be easiest to get everything to a good state in this temporary PR and then use the diff between what exists in this PR and the upstream APIs for making the changes. That way you don't have to do multiple rounds of iteration.

// - When status is True and reason is ProbesSucceeded, the ClusterExtensionRevision has been successfully rolled out and all objects pass their readiness probes.
// - When status is False and reason is ProbeFailure, one or more objects are failing their readiness probes during rollout.
// - When status is Unknown and reason is Reconciling, the ClusterExtensionRevision has encountered an error that prevented it from observing the probes.
// - When status is Unknown and reason is Archived, the ClusterExtensionRevision has been archived and its objects have been torn down.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like Available = False might be more appropriate?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the convention we're trying to set here is that True vs False implies we're actively checking readiness probes to determine the state of installed content. Once a revision has been Archived, we are no longer checking those probes, thus the availability is Unknown. Technically, the resources could still be available if they haven't been completely garbage collected yet, but since this resource is Archived we're no longer paying attention to its resources.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha - maybe we can do some updating of the text here to clarify that the unknown status is because it is archived and the system is no longer paying attention to the resources associated with this, as opposed to the current wording where it sounds like it is because the resources associated with it are removed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants